## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.600 Min. :0.1200 Min. :0.0000 Min. :0.900
## 1st Qu.: 7.100 1st Qu.:0.3950 1st Qu.:0.0900 1st Qu.:1.900
## Median : 7.900 Median :0.5200 Median :0.2500 Median :2.200
## Mean : 8.259 Mean :0.5288 Mean :0.2661 Mean :2.409
## 3rd Qu.: 9.100 3rd Qu.:0.6400 3rd Qu.:0.4200 3rd Qu.:2.600
## Max. :13.200 Max. :1.5800 Max. :1.0000 Max. :8.300
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 21.25
## Median :0.07900 Median :13.00 Median : 37.00
## Mean :0.08699 Mean :15.17 Mean : 44.52
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 60.00
## Max. :0.61100 Max. :46.00 Max. :144.00
## density pH sulphates alcohol quality
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40 3: 10
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 4: 52
## Median :0.9967 Median :3.310 Median :0.6200 Median :10.20 5:650
## Mean :0.9967 Mean :3.316 Mean :0.6569 Mean :10.43 6:614
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7275 3rd Qu.:11.10 7:190
## Max. :1.0029 Max. :4.010 Max. :2.0000 Max. :14.00 8: 18
## 'data.frame': 1534 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
Now let’s see how the data dispersed for each variable by the following histograms:
From this histograms we can get a genral idea about the dataset we have, for example: We can see from Quality histogram that most selected rate is 5 and 6 but 6 is the highiest.In addation the rate 8 is highiest rate in the dataset.
Now to make the distributions more normal for some histograms above we use logarithmic transformation (log 10) for them to reduce skew. We can see that total sulfur dioxide , fixed acidity and sulphates have long distribution and make them looks better on the graph we do the following:
## nbr.val nbr.null nbr.na min max
## 1.534000e+03 0.000000e+00 0.000000e+00 3.300000e-01 2.000000e+00
## range sum median mean SE.mean
## 1.670000e+00 1.007700e+03 6.200000e-01 6.569100e-01 4.339978e-03
## CI.mean.0.95 var std.dev coef.var
## 8.512921e-03 2.889351e-02 1.699809e-01 2.587583e-01
## nbr.val nbr.null nbr.na min max
## 1.534000e+03 1.000000e+00 0.000000e+00 -4.814861e-01 3.010300e-01
## range sum median mean SE.mean
## 7.825161e-01 -2.979098e+02 -2.076083e-01 -1.942046e-01 2.476082e-03
## CI.mean.0.95 var std.dev coef.var
## 4.856866e-03 9.404925e-03 9.697899e-02 -4.993651e-01
Transformed the long tailed sulphates data for a more accurate distribution. The log10 produces a relatively normal distribution. Variance decreases for log10 sulphates and graph looks more normal.
## nbr.val nbr.null nbr.na min max
## 1.534000e+03 0.000000e+00 0.000000e+00 4.600000e+00 1.320000e+01
## range sum median mean SE.mean
## 8.600000e+00 1.266870e+04 7.900000e+00 8.258605e+00 4.167883e-02
## CI.mean.0.95 var std.dev coef.var
## 8.175356e-02 2.664750e+00 1.632406e+00 1.976612e-01
## nbr.val nbr.null nbr.na min max
## 1.534000e+03 0.000000e+00 0.000000e+00 6.627578e-01 1.120574e+00
## range sum median mean SE.mean
## 4.578161e-01 1.394122e+03 8.976271e-01 9.088150e-01 2.124008e-03
## CI.mean.0.95 var std.dev coef.var
## 4.166269e-03 6.920504e-03 8.318956e-02 9.153630e-02
Fixed acidity appear to be long tailed too, and transforming its log appears to make it closer to a normal distribution. Variances are confirmed to be a relevant decrease for fixed acidity.
## nbr.val nbr.null nbr.na min max
## 1.534000e+03 0.000000e+00 0.000000e+00 1.200000e-01 1.580000e+00
## range sum median mean SE.mean
## 1.460000e+00 8.111800e+02 5.200000e-01 5.288005e-01 4.533375e-03
## CI.mean.0.95 var std.dev coef.var
## 8.892273e-03 3.152599e-02 1.775556e-01 3.357705e-01
## nbr.val nbr.null nbr.na min max
## 1.534000e+03 3.000000e+00 0.000000e+00 -9.208188e-01 1.986571e-01
## range sum median mean SE.mean
## 1.119476e+00 -4.633255e+02 -2.839967e-01 -3.020375e-01 3.882498e-03
## CI.mean.0.95 var std.dev coef.var
## 7.615568e-03 2.312319e-02 1.520631e-01 -5.034577e-01
Volatile acidity appear to be long tailed also, and transforming its log appears to make it closer to a normal distribution like others above. Since pH is a logarithmic term, and is normal, then it would be sense for the log of acidity levels to also be approximately normal. Variances are confirmed to be a relevant decrease for it but not entirely.
## nbr.val nbr.null nbr.na min max
## 1.534000e+03 0.000000e+00 0.000000e+00 6.000000e+00 1.440000e+02
## range sum median mean SE.mean
## 1.380000e+02 6.829000e+04 3.700000e+01 4.451760e+01 7.624117e-01
## CI.mean.0.95 var std.dev coef.var
## 1.495480e+00 8.916706e+02 2.986085e+01 6.707651e-01
## nbr.val nbr.null nbr.na min max
## 1.534000e+03 0.000000e+00 0.000000e+00 7.781513e-01 2.158362e+00
## range sum median mean SE.mean
## 1.380211e+00 2.379277e+03 1.568202e+00 1.551028e+00 7.633760e-03
## CI.mean.0.95 var std.dev coef.var
## 1.497372e-02 8.939277e-02 2.989862e-01 1.927665e-01
Transformed the long tailed total sulfur dioxide data for a more accurate distribution. The log10 produces a relatively normal distribution for it. Total sulfur dioxide variance decreases significantly and as such appears to be nearly normal.
What about Quality ?
From the previews as we said we can see that most ratings are 6 and 5, to make a histogram that provide more value to us we can divide these ratings into categories like the following
## bad average excellent
## 62 1264 208
There are 1599 observations in total and 1534 observations after removing the top 1% from the variables that had large outliers.
Quality and alcohol is the main features.
I see that residual sugar and alcohol will play main role in the wine quality and taste.
Yes, Three variables from the quality variable: (0< bad <5), (5 =< average <7), (excellent= 7 & 8)
Performing logarithmic transformation on the following features : 1-Sulphates 2-fixed acidity 3-total/free sulfur dioxide
Removing top 1% of values of some features like fixed acidity, residual sugar, total sulfur dioxide, and free sulfur dioxide.
The first column was removed because it was an index for the observations.
Noe we’ll create correlation matrix to see the correlations between two variables.
We are going to see the relation between the acidities and pH where it’s appear that the correlation coefficient is -0.67
## [1] -0.6794406
Now the correlation between citric acid and pH (-0.52)
## [1] -0.5283267
The correlation coefficient between volatile acidity and pH is 0.23
## [1] 0.2387919
The correlation coefficient between volatile acidity and citric acid is -0.56
## [1] -0.5629224
The correlation coefficient between alcohol and pH is 0.21
## [1] 0.2166557
Some intresting Points: Lower pH indicates a higher acidity. The more citric acid get higher the more sulphates will get higher as well. The colleration between Volatile acidity and citric acid is negative. The colleration between Citric acid and pH is negative. *pH and alcohol are very weakly correlated.
Volatile acidity and citric acid were negatively correlated, as were citric acid and pH. Fixed acidity and pH were negatively correlated, due to the lower pH/more acidic effect.
Citric Acid and Volatile Acidity, which had a correlation coefficient of -0.563.
Next we are going to see the relationship between two variables based on the quality starting with Citric acid and Alcohol using scatterplots.
Citric acid and Alcohol
Now with Sulphates and Alcohol
Most bad wines seem to have higher levels of volatile acidity, and most excellent wines also had lower levels of volatility.
## redWine$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.5800 0.6800 0.7306 0.8838 1.5800
## --------------------------------------------------------
## redWine$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.4100 0.5400 0.5386 0.6400 1.3300
## --------------------------------------------------------
## redWine$rating: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3100 0.3700 0.4090 0.4925 0.9150
Between volatile acidity with sulphates, it’s clear that excellent wines have a lower volatile acidity and a higher sulphates content and bad wines have a higher volatile acidity content and lower sulphates content.
## redWine$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.5800 0.6800 0.7306 0.8838 1.5800
## --------------------------------------------------------
## redWine$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.4100 0.5400 0.5386 0.6400 1.3300
## --------------------------------------------------------
## redWine$rating: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3100 0.3700 0.4090 0.4925 0.9150
Next plot is intresting … Probability of the wine being excellent is zero when volatile acidity is greater than 1. When volatile acidity is either 0 or 0.3, there is roughly a 40% probability that the wine is excellent. When volatile acidity is between 1 and 1.2 there is an 80% chance that the wine is bad. Any wine with a volatile acidity greater than 1.4 has a 100% chance of being bad.
Some intresting Points:
Most bad wines seem to have higher levels of volatile acidity. Most excellent wines also had lower levels of volatility. Excellent wines have a lower volatile acidity and a higher sulphates content. Bad wines have a higher volatile acidity content and lower sulphates content.
## redWine$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.60 10.00 10.20 10.97 13.10
## --------------------------------------------------------
## redWine$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.50 10.00 10.26 10.90 14.00
## --------------------------------------------------------
## redWine$rating: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.50 10.80 11.60 11.54 12.22 14.00
This graph shows the relationship between the pH and Quality ammounts so we can see form it that the lower pH level the more quality increase and higher pH level the more quality decrease.
Probability of the wine being excellent is zero when volatile acidity is greater than 1. When volatile acidity is either 0 or 0.3, there is roughly a 40% probability that the wine is excellent. When volatile acidity is between 1 and 1.2 there is an 80% chance that the wine is bad. Any wine with a volatile acidity greater than 1.4 has a 100% chance of being bad.
Bad wine has lower sulphates and alcohol level varying between 9% and 12%. Average wines have higher concentrations of sulphates. *wines that are rated 6 tend to have higher alcohol content and larger sulphates content.
*This graph makes it fairly clear that both sulphates and alcohol content contribute to quality.
The data set contains information on 1,599 red wines.Due to a large number of different chemicals variables, I made assumptions that some variables have a relationship with each other which is true like pH was negatively correlated to volatile acidity which makes sense. Also alcohol levels appeared to be the most important for determining high quality wine. Volatile acidity made a wine bad in large amounts, regardless of the circumstances. And this makes sense as large amounts of acetic acid create a bitter taste.
We can say that there is a weaknesses in this data due to biases in the wine tasters’ preferences. When the wine tasters be experts, they tend to look for advanced things in wine than the noraml person.
The best part of this project and for me the main success was exploring and somehow predicting a wine quality with a few technical variables without actually tasting it. Just by exploring data, anyone can figure out basic trends. I struggled because of the lack of expericnce in wines contents and what do they mean and also I struggled with choosing the most appropriate graph for a each context during the analsys.
In the future work an expert reviews could be added to improve the dataset. Getting feedback from reviewers with explanation of how these reviewers rate a wine may add a value to the analsys process.